Maria Bochenek
In this assigment we will again use QSAR-biodegradation dataset LINK for binary classification task. The dataset has 17 features and 1055 rows. We changes classes labels from $\{1, 2\}$ to $\{0, 1\}$ for consistency with the procedure in HW1 (it was necessary to use BCELoss while training Neural Network Classifier). Class 0 has $699$ instances and class 1 has $356$, thus this dataset's imbalance ratio is equal to $1.96$. The task is to distinguish ready (class 1) and not ready (class 0) biodegradable molecules based on molecular stucture descriptors.
We selected sklearn.ensemble.GradientBoostingClassifier, which was one of better performing models based on analysis from HW1. To extend the scope of our analysis we included sklearn.linear_model.LogisticRegression as an example of model with worse performance for comparison.\
We set random seed for reproductibility and split dataset into training and validation datasets 80:20. We performed feature scaling using sklearn.preprocessing.StandardScaler. Models performance is presented in table below.
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| GradientBoostingClassifier | 0.898305 | 0.688312 | 0.779412 | 0.85782 | 0.92568 |
| LogisticRegression | 0.762712 | 0.714286 | 0.737705 | 0.848341 | 0.899309 |
We set random seed for reproductibility and split dataset into training and validation datasets 80:20. We perform feature scaling using sklearn.preprocessing.StandardScaler. Then we sampled 2 observations from test datasets, one from each class. The selected observations are presented below.
| V1 | V2 | V8 | V12 | V13 | V14 | V15 | V17 | V18 | V22 | V27 | V28 | V30 | V31 | V36 | V37 | V39 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 879 | 0.00285906 | -0.451687 | 0.430836 | -1.56032 | 0.0322026 | 0.699801 | 0.0801493 | 0.29255 | -0.14583 | -0.376155 | 0.0615274 | -0.00880728 | 1.07368 | 0.254992 | -0.26587 | 0.648406 | -0.129319 | 1 |
| 1002 | 1.41747 | 1.87729 | -0.39487 | 0.264144 | 0.920539 | -0.26365 | 1.29818 | 0.630873 | 0.472339 | -0.244208 | 1.45952 | 0.00355073 | 5.032 | -0.599315 | 0.158288 | 1.10398 | 0.581106 | 0 |
The predictions for the selected observations are presented below.
| 879 | 1002 | |
|---|---|---|
| TARGET | 1 | 0 |
| GradientBoostingClassifier | 1 | 0 |
| LogisticRegression | 1 | 0 |
We can see that the SHAP decomposition differs for selected observations depending on which package is used. It is probably because Shapley values are not calculated implicitly but are approximated with linear models.
However, it is important to note that only three variables with the highest attribution for observation $879$ were the same (including the order) for both Dalex and shap packages. In comparison, it was four variables for observation $1002$. Additionally, both Dalex and shap packages estimated that the 7th, 8th, and 9th variables in terms of importance importance were the same for observation $1002$. Moreover, for observation $1002$ both implementations state that values of variables $V22$ and $V37$ have positive attribution toward prediction being of class 1, which is an incorrect class. (Ad Task A6)
Selected observations have different sets of top 3 variables of the highest importance. For observation $879$ (belonging to class 1) top 3 variables with the highest absolute attribution are in order: $V12, V22$, and $V1$, while for observation 1002 (belonging to class 0): $V36, V1$, and $V27$ (Ad Task A4).
Even though V1 has high importance for both observations it has different attribution: positive for $879$ and negative for 1002 (Ad Task A5). Also, $V1$ has higher importance (second) for observation $1002$ than for observation $879$ (third).
We compare sklearn.ensemble.GradientBoostingClassifierwith sklearn.linear_model.LogisticRegression using dalex package.
Decompositions of predictions for models are presented below (Ad Task A7).
The decomposition of predictions for observations $879$ and $1002$ differ for Gradient Boosting and Logistic Regression classifiers. Interestingly only the second most important variable is the same for both models ($V22$ for observation $879$ and $V1$ for $1002$).
The variables present in the top 9 most important variables for both models (but in different positions) are
Moreover, for both models, $V22$ and $V37$ both have positive attribution for observation $1002$, while $V31$ has positive attribution for Logistic Regression and negative for Gradient Boosting for observation $879$.
Calculate Shapley values for player A given the following value function
v() = 0
v(A) = 20
v(B) = 20
v(C) = 60
v(A,B) = 60
v(A,C) = 70
v(B,C) = 70
v(A,B,C) = 100
Shapley values are defines as
$$\phi_j = \frac{1}{|P|!} \sum_{\pi \in \Pi} (v(S_j^\pi \cup \{j\}) - v(S_j^\pi) )$$,
where $\Pi$ is set of all possible permutations of players $P$ while $S_j^\pi$ is a coalition before player $j$ joins in permutation $\pi$.
Formulation for Shapley values can be rewritten as follows
$$\phi_j = \sum_{S \subseteq P \setminus \{j\} } \frac{ |S|! (|P|-|S|-1)!}{|P|!} (v(S \cup \{j\}) - v(S))$$.
We will use this formulation to compute $\phi_A$.
We have 3 players, so $|P| = 3$.Player A can join the coalition as a first, second or third member. When A joins first (joins empty coalition) we have $S = ()$, so the A's contribution is $\frac{0!2!}{3!}(v(A)- v())$, when A joins second, there are two possible coalitions he can join $S=(B)$ or $S=(C)$, so A's contribution is $\frac{1!1!}{3!}(v(A, B)- v(B)) + \frac{1!1!}{3!}(v(A, C)- v(C))$, and when joins third then there is only one possible coalition he can join $S = (B, C)$, so A's contributions is $\frac{2!0!}{3!}(v(A, B, C)- v(B, C))$.
Thus we have
$$\begin{align*} \phi_A &= \sum_{S \subseteq P \setminus \{j\} } \frac{ |S|! (|P|-|S|-1)!}{|P|!} (v(S \cup \{j\}) - v(S)) \\ &= \frac{0!2!}{3!}(v(A)- v()) + \frac{1!1!}{3!}(v(A, B)- v(B)) + \frac{1!1!}{3!}(v(A, C)- v(C)) + \frac{2!0!}{3!}(v(A, B, C)- v(B, C)) \\ &= \frac{1}{3}(20-0) + \frac{1}{6}(60 - 20) + \frac{1}{6}(70 - 60) + \frac{1}{3}(100 - 70)\\ &= \frac{20}{3} + \frac{40}{6} + \frac{10}{6} + \frac{30}{3} \\ &=\frac{20+25+30}{3} \\ &= \frac{75}{3} \\ &= 25 \end{align*}$$So shapley value for player A equals $25$, which is the average contribution of this player while building coalition in this coalition game.
import dalex as dx
import xgboost
import lime
import shap
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
import pandas as pd
import numpy as np
import random
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_sample_weight
import scipy.optimize as so
from scipy import diag, interp
from itertools import cycle
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
It can be one of random forest, GBM, CatBoost, XGBoost, LightGBM (various types) etc.
url = "https://raw.githubusercontent.com/adrianstando/imbalanced-benchmarking-set/main/datasets/qsar-biodeg.csv"
df = pd.read_csv(url,index_col=0)
# change target labels 2->1 and 1->0
df["TARGET"]-= 1
display(df.head())
X = df.loc[:, df.columns != 'TARGET']
y = df["TARGET"]
| V1 | V2 | V8 | V12 | V13 | V14 | V15 | V17 | V18 | V22 | V27 | V28 | V30 | V31 | V36 | V37 | V39 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.919 | 2.6909 | 31.4 | 0.000 | 3.106 | 2.550 | 9.002 | 0.960 | 1.142 | 1.201 | 1.932 | 0.011 | 0.000 | 4.489 | 2.949 | 1.591 | 7.253 | 1 |
| 1 | 4.170 | 2.1144 | 30.8 | 0.000 | 2.461 | 1.393 | 8.723 | 0.989 | 1.144 | 1.104 | 2.214 | -0.204 | 0.000 | 1.542 | 3.315 | 1.967 | 7.257 | 1 |
| 2 | 3.932 | 3.2512 | 26.7 | 0.000 | 3.279 | 2.585 | 9.110 | 1.009 | 1.152 | 1.092 | 1.942 | -0.008 | 0.000 | 4.891 | 3.076 | 2.417 | 7.601 | 1 |
| 3 | 3.000 | 2.7098 | 20.0 | 0.000 | 2.100 | 0.918 | 6.594 | 1.108 | 1.167 | 1.024 | 1.414 | 1.073 | 8.361 | 1.333 | 3.046 | 5.000 | 6.690 | 1 |
| 4 | 4.236 | 3.3944 | 29.4 | -0.271 | 3.449 | 2.753 | 9.528 | 1.004 | 1.147 | 1.137 | 1.985 | -0.002 | 10.348 | 5.588 | 3.351 | 2.405 | 8.003 | 1 |
# Spliting data
SEED = 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
scaler = StandardScaler().fit(X_train)
#scaling - transforming
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# convert back to dataframes
X_train = pd.DataFrame(X_train, columns = X.columns, index = y_train.index)
X_test = pd.DataFrame(X_test, columns = X.columns, index = y_test.index)
models = {
'GradientBoostingClassifier': GradientBoostingClassifier(random_state=SEED, n_estimators=100),
'LogisticRegression': LogisticRegression(random_state=SEED, penalty='l2', solver = 'lbfgs'),
}
explainers = {}
for name, model in models.items():
model.fit(X_train, y_train)
explainer = dx.Explainer(model, X_test, y_test)
# save explainer
explainers[name] = explainer
Preparation of a new explainer is initiated -> data : 211 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 211 values -> model_class : sklearn.ensemble._gb.GradientBoostingClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x0000017CC6553A60> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.00988, mean = 0.358, max = 0.983 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.926, mean = -0.0782, max = 0.934 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 211 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 211 values -> model_class : sklearn.linear_model._logistic.LogisticRegression (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x0000017CC6553A60> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 9.77e-11, mean = 0.348, max = 0.98 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.968, mean = -0.0685, max = 0.994 -> model_info : package sklearn A new explainer has been created!
performance = pd.concat([explainer.model_performance().result for explainer in explainers.values()], axis=0)
performance
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| GradientBoostingClassifier | 0.898305 | 0.688312 | 0.779412 | 0.857820 | 0.925680 |
| LogisticRegression | 0.762712 | 0.714286 | 0.737705 | 0.848341 | 0.899309 |
np.random.seed(SEED)
y_test_0 = y_test[y_test == 0]
y_test_1 = y_test[y_test == 1]
idx_0 = np.random.choice(y_test_0.index, 1)
idx_1 = np.random.choice(y_test_1.index, 1)
idx = np.concatenate((idx_1, idx_0))
obs_var = X_test.loc[idx]
obs_target = y_test[idx].to_frame()
obs_df = obs_var.join(obs_target)
display(obs_df)
| V1 | V2 | V8 | V12 | V13 | V14 | V15 | V17 | V18 | V22 | V27 | V28 | V30 | V31 | V36 | V37 | V39 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 879 | 0.002859 | -0.451687 | 0.430836 | -1.560322 | 0.032203 | 0.699801 | 0.080149 | 0.292550 | -0.145830 | -0.376155 | 0.061527 | -0.008807 | 1.073682 | 0.254992 | -0.265870 | 0.648406 | -0.129319 | 1 |
| 1002 | 1.417468 | 1.877287 | -0.394870 | 0.264144 | 0.920539 | -0.263650 | 1.298176 | 0.630873 | 0.472339 | -0.244208 | 1.459517 | 0.003551 | 5.032002 | -0.599315 | 0.158288 | 1.103983 | 0.581106 | 0 |
pred_df = obs_target.copy()
for name, model in models.items():
pred_df[name] = model.predict(obs_var)
display(pred_df.T)
| 879 | 1002 | |
|---|---|---|
| TARGET | 1 | 0 |
| GradientBoostingClassifier | 1 | 0 |
| LogisticRegression | 1 | 0 |
E.g. for Python: dalex and shap, for R: DALEX and iml.
explainer = explainers['GradientBoostingClassifier']
shap_attributions = [explainer.predict_parts(obs_var.loc[[i]],
type="shap",
label=f'Observation {i}')
for i in obs_var.index]
bd_attributions = [explainer.predict_parts(obs_var.loc[[i]],
type="break_down",
label=f'Observation {i}')
for i in obs_var.index]
shap_attributions[0].plot(shap_attributions[1::], title="Shapley Values for Gradient Boosting Classifier")
bd_attributions[0].plot(bd_attributions[1::], title="Break Down for Gradient Boosting Classifier")
model = models['GradientBoostingClassifier']
shap_explainer = shap.explainers.Tree(model,
data=X_test,
model_output="probability")
shap_values = shap_explainer(obs_var)
for i in range(len(shap_values)):
plt.title(f"Observation {obs_var.index[i]}", fontsize = 15)
shap.plots.waterfall(shap_values[i], show=True)
plt.close()
See above. Selected observations satisfy this condition.
See above. Selected observations satisfy this condition.
We can clearly see that the SHAP decomposition differs for selected observations depending on which package is used. It is probably due to the fact that shapley values are not calculated implicitly but rather they are approximated.
However, it is important to note that three variables with the highest attribution for observation 879 were the same for both dalex and shap packages, while it was four variables for observation 1002. Moreover for observation 1002 both implementations state that values of variables V22 and V37 have positive attribution toward prediction being of class 1, which is inccorect class.
We compare sklearn.ensemble.GradientBoostingClassifierwith sklearn.linear_model.LogisticRegression using dalex package.
explainer = explainers['LogisticRegression']
shap_attributions = [explainer.predict_parts(obs_var.loc[[i]],
type="shap",
label=f'Observation {i}')
for i in obs_var.index]
bd_attributions = [explainer.predict_parts(obs_var.loc[[i]],
type="break_down",
label=f'Observation {i}')
for i in obs_var.index]
shap_attributions[0].plot(shap_attributions[1::], title="Shapley Values for Logistic Regression")
bd_attributions[0].plot(bd_attributions[1::], title="Break Down for Logistic Regression")
shap_attributions = []
for i in obs_var.index:
for name in explainers:
shap_attributions.append(explainers[name].predict_parts(obs_var.loc[[i]],
type="shap",
label=f'Observation {i} for {name}'))
shap_attributions[0].plot(shap_attributions[1::], title="Shapley Values - model comparison")
import plotly
plotly.offline.init_notebook_mode()